feat(messages): add native Anthropic Messages API (/v1/messages)#5386
feat(messages): add native Anthropic Messages API (/v1/messages)#5386franciscojavierarceo merged 9 commits intollamastack:mainfrom
Conversation
✱ Stainless preview buildsThis PR will update the
|
|
going to add integration tests here too since ollama is compatible |
|
This pull request has merge conflicts that must be resolved before it can be merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
|
This pull request has merge conflicts that must be resolved before it can be merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
Add the API layer for the Anthropic Messages API (/v1/messages). This includes the Messages protocol definition, Pydantic models for all Anthropic request/response types (content blocks, streaming events, tool use, thinking), and FastAPI routes with Anthropic-specific SSE streaming format. Also registers the "messages" logging category and adds Api.messages to the Api enum. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>
…ive passthrough Add the single BuiltinMessagesImpl provider that translates Anthropic Messages format to/from OpenAI Chat Completions, delegating to the inference API. For providers that natively support /v1/messages (e.g. Ollama), requests are forwarded directly without translation. Also registers the provider in the registry, wires the router in the server, and adds Messages to the protocol map in the resolver. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>
…ions Add the messages provider (inline::builtin) to the starter distribution template and regenerate configs for starter and ci-tests distributions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>
Add 17 unit tests covering request translation, response translation, and streaming translation. Regenerate OpenAPI specs, provider docs, and Stainless SDK config to include the new /v1/messages endpoints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>
Add a new messages integration test suite that exercises the Anthropic Messages API (/v1/messages) end-to-end through the server. The suite includes 13 tests covering non-streaming, streaming, system prompts, multi-turn conversations, tool definitions, tool use round trips, content block arrays, error handling, and response headers. To enable replay mode (no live backend required), extend the api_recorder to patch httpx.AsyncClient.post and httpx.AsyncClient.stream. This captures the native Ollama passthrough requests the Messages provider makes via raw httpx, following the same pattern used for aiohttp rerank recording. Recordings are stored in tests/integration/messages/recordings/. Also fix pre-commit violations: structured logging in impl.py, unused loop variable, and remove redundant @pytest.mark.asyncio decorators from unit tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>
…pruned The cleanup_recordings.py script uses ci_matrix.json to determine which test suites are active. Without the messages suite listed, the script considers all messages recordings unused and deletes them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>
skamenan7
left a comment
There was a problem hiding this comment.
Thanks for this PR allowing llamastack to work with Claude Code. Great addition!
| # -- Native passthrough for providers with /v1/messages support -- | ||
|
|
||
| # Module paths of provider impls known to support /v1/messages natively | ||
| _NATIVE_MESSAGES_MODULES = {"llama_stack.providers.remote.inference.ollama"} |
There was a problem hiding this comment.
I noticed the native passthrough path hardcodes provider module paths and reaches into routing table internals that currently only fire for Ollama. Would it make sense to remove this path for now and let Ollama go through the translation layer like other providers? If the native path is important for thinking blocks, it might fit better as a dedicated remote provider rather than a fork inside inline::builtin.
| openai_client, | ||
| require_server, | ||
| ) | ||
|
|
There was a problem hiding this comment.
question, the tests skip library client mode. is that intentional for the translation path too, or just the native passthrough? since the translation path delegates to Api.inference, I'd have thought library mode might work there.
|
|
||
| async with httpx.AsyncClient() as client: | ||
| resp = await client.post(url, json=body, headers=headers, timeout=300) | ||
| resp.raise_for_status() |
There was a problem hiding this comment.
seems resp.raise_for_status() throws httpx.HTTPStatusError which the route handler doesn't specifically catch, so it falls into the generic handler and becomes a 500. checked the empty-messages recording and Ollama is actually returning a 400 while the test lets 500 through. catching httpx.HTTPStatusError and re-raising as
HTTPException with the original status code may help.
| body: dict[str, Any], | ||
| ) -> AsyncIterator[AnthropicStreamEvent]: | ||
| """Stream SSE events directly from the provider.""" | ||
| async with httpx.AsyncClient() as client: |
There was a problem hiding this comment.
I noticed _passthrough_stream is an async generator, which means the async with httpx.AsyncClient() block only executes when the caller starts iterating — which happens in the SSE serializer, downstream of _passthrough_request's error handling. Would it make sense to eagerly establish the connection first (following the two-step unwrap pattern) so a 4xx from Ollama surfaces as a clean error rather than a mid-stream exception after the 200 is already sent?
Claude Code sends thinking.type: "adaptive" which was not accepted by the Messages API model. Add "adaptive" as a valid thinking type literal and treat it the same as "enabled" in the translation path. Regenerate OpenAPI specs to reflect the schema change. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>
|
@mattf with the most recent changes:
🤩 |
Addressed reviewer feedback that each passthrough request created a new httpx.AsyncClient, incurring TCP overhead. A shared client is now created in initialize() and reused for both normal and streaming passthrough calls. The implementation also removes an unused client variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>
|
my most recent commit was generated using claude connected to llama stack running this patch: both passthrough paths create a new httpx.AsyncClient() on each request. Would it make sense to create a shared client at initialize() and reuse it? Per-request clients mean a new server logs: a few errors here and there and gpt-oss seems to be a litttttle less proficient than opus. but overall it was able to do it! |
|
@cdoern can you try it just with OpenAI gpt-5? |
|
this also works with gpt-4o translating via chat/completions: the 404s are strange errors caused by Claude Code's Anthropic SDK sends background requests (token counting, etc.) using hardcoded Anthropic model names like claude-haiku-4-5-20251001, even when the main conversation is configured to use openai/gpt-4o. The llama-stack server doesn't have those Anthropic models registered, so those background requests fail with ModelNotFoundError. These errors are harmless, they don't affect your actual conversation, which routes through openai/gpt-4o successfully. Claude Code handles the failures gracefully on its end (token counting is best-effort). |
… OpenAI translation Single text-part user messages were sent as a bare dict instead of a string, causing Pydantic validation errors. The Anthropic thinking parameter was also forwarded to OpenAI chat completions which does not support it. Additionally, ModelNotFoundError now returns a clean 404 instead of a 500 traceback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Charlie Doern <cdoern@redhat.com>
mattf
left a comment
There was a problem hiding this comment.
we'll need a way to register model names that claude expects to finds, even if we route them to another implementation
|
FYI you can set the default model strings used for Haiku / Sonnet / Opus via env variables, such as: Then, for any task that needs one of those models, it uses your default string as opposed to its hardcoded ones. |

Summary
/v1/messagesendpoint implementing the Anthropic Messages API, enabling llama-stack to serve as a drop-in backend for Claude Code, Codex CLI, and other Anthropic-protocol clientsinline::builtinprovider (BuiltinMessagesImpl) that depends onApi.inferenceand works with all inference backends automatically/v1/messages(e.g. Ollama), requests are forwarded directly without translation, preserving full fidelity (thinking blocks, native streaming, etc.)What's included
src/llama_stack_api/messages/): Protocol, Pydantic models for all Anthropic types (content blocks, streaming events, tool use, thinking), FastAPI routes with Anthropic-specific named SSE eventssrc/llama_stack/providers/inline/messages/): Translation layer (request/response/streaming) + native passthrough for Ollamatests/integration/messages/recordings/): httpx-level record-replay for the Messages API native passthrough path, enabling CI replay without a live backendsrc/llama_stack/testing/api_recorder.py): Added httpx.AsyncClient patching (postandstream) to record/replay raw httpx requests used by the native passthrough, complementing the existing OpenAI client patchingTranslation map
system(top-level)messages[0]role=systemtool_useblocktool_callson assistant msgtool_resultblockrole: "tool"messagetool_choice: "any"tool_choice: "required"stop_sequencesstopstop_reason: "end_turn"finish_reason: "stop"stop_reason: "tool_use"finish_reason: "tool_calls"Test plan
uv run pytest tests/unit/providers/inline/messages/ -x --tb=short -v(17/17 passing)uv run pre-commit run mypy --all-files(passes)tests/integration/messages/,messagessuite in CI matrix)Generated with Claude Code